Course Project Notebook: Analysis of Crunchbase

Executive Summary

Our main goal is to study trends and patterns of big companies as well as upcoming startups in tech. The startup culture has lead to an enormous growth in the tech industry and generated some of the most innovative products and services in history.

We have collected data from a website called Crunchbase which lists all businesses and startups in the tech industry and provides information on their respective category, investments, fundings, acquisitions and IPOs among many other features.

We aim to cluster VC and investor groups in certain categories, predict funding potential for upcoming startups and investment opportunities for investors. We will also use various models for feature selection to predict the aforementioned characteristics. Overall, our objective is to gain a startup as well as VC-side perspective of funding from crunchbase data.

We plan to use k-means and LSI for clustering investors and try a variety of regression approaches such as linear regression, GLMs or GAMs to determine funding and investment potential.

Team Members

Ansuya Ahluwalia (ansuya@cs.ucla.edu) -- exploratory data analysis, graph mining

Ashwini Bhatkhande (ash@cs.ucla.edu) -- machine learning, production of final notebook

Raghav Mehrish (rmehrish@ucla.edu) -- data extraction, machine learning

Shivin Kapur (shivinkapur@ucla.edu) -- data modeling, production of final notebook

Data

  • Companies
  • Acquisitions
  • Rounds
  • Investments

Tools and Packages

  • Exploratory Data Analysis
    • Python : Pandas, numpy, matplotlib.pyplot
    • R : data.table, plyr, ggplot2, reshape2

  • Data Modeling
    • R : rmisc, MASS, plyr, splines, data.table

  • Predictive Modeling
    • Python : pandas, numpy, sklearn, pylab
    • R : caret, kernlab, plyr, e1071, MASS, nnet, knn3

  • Graph Mining
    • R : igraph, Matrix

Exploratory Data Analysis

In [1]:
%load_ext rmagic
import rpy2 as Rpy
In [20]:
#Histogram of company categories
/Users/Work/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

Out[20]:
<matplotlib.axes.AxesSubplot at 0x108d38d50>
In [21]:
#Total Funding per Company Category
Out[21]:
<matplotlib.legend.Legend at 0x107b619d0>
In [24]:
#Frequency of Status
Out[24]:
<matplotlib.axes.AxesSubplot at 0x10672cd50>
In [23]:
#Histogram of founded year
Out[23]:
<matplotlib.axes.AxesSubplot at 0x1066b0550>
In [26]:
#Total funding with funding quarter and no. of companies
In [7]:
 
      company_category_code company_country_code funding_round_type funded_year
11635                   web                  USA            venture        2010
11636                   web                  USA            venture        2011
11637                   web                  USA            venture        2012
11638                   web                  USA            venture        2013
11639                   web                  USA            venture        2014
11640                   web                  ZAF           series-a        2012
      total_amount
11635    244010490
11636    416615077
11637    191888230
11638    121948356
11639    170653603
11640       600000

In [16]:
# Problems faced during feature extraction
In [13]:
# Merged table with important data
 [1] "name"                   "permalink"              "homepage_url"          
 [4] "category_code"          "funding_total_usd"      "status"                
 [7] "country_code"           "state_code"             "region"                
[10] "city"                   "funding_rounds"         "founded_at"            
[13] "founded_year"           "investor_permalink"     "investor_name"         
[16] "investor_category_code" "investor_country_code"  "investor_state_code"   
[19] "investor_region"        "investor_city"          "funding_round_type.y"  
[22] "funded_at.y"            "funded_month.y"         "funded_quarter.y"      
[25] "funded_year.y"          "raised_amount_usd.y"    "quarters"              

In [14]:
# All important features
 [1] "category_code"          "status"                 "country_code"          
 [4] "funding_rounds"         "founded_year"           "investor_category_code"
 [7] "investor_country_code"  "funding_round_type.y"   "funded_year.y"         
[10] "quarters"               "labelNum"              

Predictive Modeling

In []:
%%R

trainClass = read.csv('/Users/raghav297/dropbox/Documents/UCLA/UCLA_Spring_14/CS249/Crunchbase_Data/trainClass.csv')
testDescr = read.csv('/Users/raghav297/dropbox/Documents/UCLA/UCLA_Spring_14/CS249/Crunchbase_Data/testDescr.csv')

trainDescr = read.csv('/Users/raghav297/dropbox/Documents/UCLA/UCLA_Spring_14/CS249/Crunchbase_Data/trainDescr.csv')
testClass = read.csv('/Users/raghav297/dropbox/Documents/UCLA/UCLA_Spring_14/CS249/Crunchbase_Data/testClass.csv')

trainDescr$X = NULL
trainClass$X = NULL
testDescr$X = NULL
testClass$X = NULL
In [4]:
# Column names of training and testing descriptors and targets
# Dimensions of training and testing descriptors and targets
 [1] "category_codeadvertising"      "category_codebiotech"         
 [3] "category_codeecommerce"        "category_codeenterprise"      
 [5] "category_codemobile"           "category_codesoftware"        
 [7] "category_codeweb"              "statusoperating"              
 [9] "country_codeUSA"               "funding_rounds"               
[11] "founded_year"                  "investor_country_codeUSA"     
[13] "funding_round_type.yseries.a"  "funding_round_type.yseries.b" 
[15] "funding_round_type.yseries.c." "funding_round_type.yventure"  
[17] "funded_year.y"                 "quartersQ2"                   
[19] "quartersQ3"                    "quartersQ4"                   
 [1] "category_codeadvertising"      "category_codebiotech"         
 [3] "category_codeecommerce"        "category_codeenterprise"      
 [5] "category_codemobile"           "category_codesoftware"        
 [7] "category_codeweb"              "statusoperating"              
 [9] "country_codeUSA"               "funding_rounds"               
[11] "founded_year"                  "investor_country_codeUSA"     
[13] "funding_round_type.yseries.a"  "funding_round_type.yseries.b" 
[15] "funding_round_type.yseries.c." "funding_round_type.yventure"  
[17] "funded_year.y"                 "quartersQ2"                   
[19] "quartersQ3"                    "quartersQ4"                   
[1] "x"
[1] "x"
[1] 164606     20
[1] 54865    20
[1] 164606      1
[1] 54865     1

In [5]:
# Sample output for one of the classifiers
['category_codeadvertising' 'category_codebiotech' 'category_codeecommerce'
 'category_codeenterprise' 'category_codemobile' 'category_codesoftware'
 'category_codeweb' 'statusoperating' 'country_codeUSA' 'funding_rounds'
 'founded_year' 'investor_country_codeUSA' 'funding_round_type.yseries-a'
 'funding_round_type.yseries-b' 'funding_round_type.yseries-c+'
 'funding_round_type.yventure' 'funded_year.y' 'quartersQ2' 'quartersQ3'
 'quartersQ4']
['category_codeadvertising' 'category_codebiotech' 'category_codeecommerce'
 'category_codeenterprise' 'category_codemobile' 'category_codesoftware'
 'category_codeweb' 'statusoperating' 'country_codeUSA' 'funding_rounds'
 'founded_year' 'investor_country_codeUSA' 'funding_round_type.yseries-a'
 'funding_round_type.yseries-b' 'funding_round_type.yseries-c+'
 'funding_round_type.yventure' 'funded_year.y' 'quartersQ2' 'quartersQ3'
 'quartersQ4']
Random Forest
Pred  [1 5 5 ..., 2 2 1]
Mean : 0.909742
Feature Importances  [ 0.01507738  0.01305174  0.01320477  0.01932507  0.01867471  0.02129487
  0.0117261   0.03326322  0.02709368  0.19284599  0.22825277  0.02048339
  0.03373769  0.03094636  0.04350878  0.0150039   0.16409708  0.03255682
  0.03360408  0.03225159]
Predict Probability  [[ 0.82431818  0.17568182  0.         ...,  0.          0.          0.        ]
 [ 0.6         0.275       0.025      ...,  0.          0.          0.        ]
 [ 0.59044289  0.33256022  0.07699689 ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          1.        ]
 [ 0.          0.          0.         ...,  0.          0.          1.        ]
 [ 0.          0.          0.         ...,  0.          0.          1.        ]]
Transform  [[-1.37108833  1.28185539  0.72992665]
 [-1.37108833  0.03388946 -1.4315749 ]
 [-1.37108833  1.28185539  1.0901769 ]
 ..., 
 [ 1.72109748  0.24188378  0.00942613]
 [ 1.72109748  0.24188378  0.00942613]
 [ 1.72109748  0.24188378  0.36967639]]

Classification Output

knn
Neighbors: 5, Accuracy: 0.837419
Trees
Mean : 0.910234 Classes [1 2 3 4 5 6 7] ['category_codeadvertising' 'category_codebiotech' 'category_codeecommerce' 'category_codeenterprise' 'category_codemobile' 'category_codesoftware' 'category_codeweb' 'statusoperating' 'country_codeUSA' 'funding_rounds' 'founded_year' 'investor_country_codeUSA' 'funding_round_type.yseries-a' 'funding_round_type.yseries-b' 'funding_round_type.yseries-c+' 'funding_round_type.yventure' 'funded_year.y' 'quartersQ2' 'quartersQ3' 'quartersQ4'] Feature Importances [ 0.01622837 0.01805052 0.01683086 0.02387598 0.02330386 0.02719037 0.02007616 0.04184556 0.03291276 0.17633547 0.17953565 0.01676038 0.02089646 0.03407675 0.05169245 0.02741202 0.143394 0.04255339 0.04517388 0.04185513] Parameters {'splitter': 'best', 'min_density': None, 'compute_importances': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'random_state': None, 'criterion': 'gini', 'max_features': None, 'max_depth': None}
SVM
Mean : 0.542404
Logistic Regression
Mean : 0.425043 Coefficients [[ -6.74029896e-02 2.45308593e-02 -1.20782550e-01 -4.29517792e-03 -5.99620023e-02 6.16633533e-02 3.19169638e-02 6.87422514e-02 -3.94510785e-01 -5.54229981e-02 2.04825404e+00 1.26281471e-01 -2.24610020e+00 -1.84775494e+00 -2.19486223e+00 -6.29456530e-01 -8.88031994e-01 -2.35726543e-02 -5.97043480e-02 -2.21975639e-01] [ -1.26700157e-02 2.52423338e-03 1.57914101e-02 -8.93324474e-03 -2.69554548e-02 6.94913922e-02 7.32423720e-02 9.29939822e-03 -3.35939026e-02 -6.86815592e-02 8.24484217e-01 -3.23513451e-01 -1.18995161e+00 -1.51903258e+00 -2.36856207e+00 -5.09394917e-01 -5.16752080e-01 -7.00693680e-02 -5.49226988e-02 -1.77407424e-02] [ 2.55463677e-02 -1.34064882e-01 1.08793547e-02 8.92122227e-03 3.00940358e-02 1.88527141e-02 7.66543137e-02 7.21465486e-02 1.52263435e-01 -1.62661012e-01 4.94138193e-01 -2.04595507e-01 9.20764209e-02 -7.14374035e-01 -1.27932294e+00 -4.05893725e-02 -1.02502073e-01 -5.30080087e-02 -3.55057424e-02 -2.70420186e-02] [ 1.05576633e-01 -8.34919369e-02 1.54682289e-03 3.40013585e-02 6.50640500e-02 9.45736640e-02 3.55419178e-02 3.83016559e-02 -2.37572727e-02 -3.82860598e-02 -8.50248509e-02 -1.10091834e-02 1.14921335e+00 6.12926381e-01 3.62936252e-01 7.56788790e-01 3.89810771e-02 3.17920698e-02 -1.78148967e-02 3.28055652e-02] [ 8.30712109e-02 -1.04929319e-01 -3.79907045e-02 8.09280104e-02 1.30688033e-02 7.81698781e-02 -5.48353945e-02 5.01371207e-02 1.21308618e-02 2.00486789e-02 -2.22484585e-01 1.45890656e-01 8.93009082e-01 1.12428045e+00 8.60697927e-01 7.84077446e-01 -4.71173596e-02 6.75062589e-02 4.68888942e-02 -3.01676225e-02] [ -1.56234462e-01 1.76853956e-01 -2.05927538e-02 -1.08639006e-01 -2.88818987e-02 -1.75801892e-01 -1.48770292e-01 -7.22909434e-02 2.90718043e-02 7.60130247e-02 -1.67084663e-01 1.70267681e-01 1.02718969e-01 7.21441498e-01 1.13186724e+00 5.08726018e-01 6.14231827e-02 -2.10530515e-02 1.94825679e-02 5.39670210e-02] [ -2.54951191e-01 6.37519166e-02 1.11294565e-01 -6.34519717e-02 -2.72995877e-01 -1.63254149e-01 -9.49132191e-02 -2.27037212e-01 -1.13123322e-01 3.90856987e-01 -2.80016728e-01 2.25141408e-01 -1.04172375e+00 -4.36026802e-01 5.60266603e-02 -6.60389386e-01 4.71304348e-01 -1.36097181e-02 1.00112396e-02 3.92822679e-02]] Intercept [-6.63405632 -4.18993133 -2.49310703 -1.9064998 -1.32079602 -1.46560506 -3.37443161] Confidence Score [[ -3.80259573e-01 -1.09565235e-01 -5.75784442e-01 ..., -3.63514781e+00 -3.30702772e+00 -3.88546763e+00] [ -2.89550464e+00 -1.22741370e+00 -1.17327836e+00 ..., -1.34007815e+00 -1.65208832e+00 -5.28072960e+00] [ -8.81845417e-01 4.56090354e-03 -5.91749617e-01 ..., -3.98208617e+00 -3.31645315e+00 -2.22541171e+00] ..., [ -7.49659823e+00 -5.49987479e+00 -3.74457074e+00 ..., -1.53283205e-01 -4.37126050e-01 -2.18507251e+00] [ -7.49659823e+00 -5.49987479e+00 -3.74457074e+00 ..., -1.53283205e-01 -4.37126050e-01 -2.18507251e+00] [ -7.92889863e+00 -6.94638761e+00 -4.70463307e+00 ..., -1.30738280e+00 3.25655230e-01 -7.15082556e-01]] Predict Probability [[ 2.98871605e-01 3.47869784e-01 2.64896076e-01 ..., 1.89172074e-02 2.60042505e-02 1.48123317e-02] [ 4.70373046e-02 2.03532619e-01 2.12179652e-01 ..., 1.86346152e-01 1.44433150e-01 4.54686487e-03] [ 2.19531470e-01 3.75743638e-01 2.67095853e-01 ..., 1.37249491e-02 2.62504901e-02 7.30970078e-02] ..., [ 5.01491478e-04 3.68043384e-03 2.08852485e-02 ..., 4.17490380e-01 3.54808091e-01 9.14076781e-02] [ 5.01491478e-04 3.68043384e-03 2.08852485e-02 ..., 4.17490380e-01 3.54808091e-01 9.14076781e-02] [ 3.02645211e-04 8.07926187e-04 7.54148487e-03 ..., 1.78975688e-01 4.88112960e-01 2.76103409e-01]] Transform [[ 1.28185539 -0.54554616 -0.46470611 -0.59320985 -0.43459879 0.72992665] [ 0.03388946 -0.54554616 -0.46470611 -0.59320985 2.30095885 -1.4315749 ] [ 1.28185539 -0.54554616 -0.46470611 -0.59320985 -0.43459879 1.0901769 ] ..., [ 0.24188378 -0.54554616 2.15188462 -0.59320985 -0.43459879 0.00942613] [ 0.24188378 -0.54554616 2.15188462 -0.59320985 -0.43459879 0.00942613] [ 0.24188378 -0.54554616 -0.46470611 1.68573385 -0.43459879 0.36967639]]
Naive Bayes
Mean : 0.304602 Probability [ 0.03121393 0.07793762 0.15145256 0.16656136 0.25540989 0.24954133 0.06788331] Mean of each feature per class [[ -7.30472081e-02 -2.42751788e-01 3.09193182e-02 -7.09320325e-02 5.06484509e-02 5.53919504e-02 2.99860734e-01 2.37510273e-01 -6.84308334e-01 -6.04511515e-01 9.35049754e-01 -6.37359695e-01 -5.31658112e-01 -4.54011594e-01 -5.88330835e-01 -2.36539728e-01 4.72324574e-01 -1.12942448e-02 5.06365802e-02 -1.20487318e-01] [ -2.78199574e-02 -1.94570885e-01 9.20310886e-02 -7.54475813e-02 3.72045676e-02 3.22735018e-02 2.63956834e-01 1.28778288e-01 -3.31293761e-01 -4.49596953e-01 6.59590982e-01 -6.88749844e-01 -3.36965564e-01 -4.25545968e-01 -5.85571330e-01 -5.88834102e-02 2.83973389e-01 -7.47910826e-02 -6.68228817e-04 2.27835183e-02] [ 4.05268539e-02 -1.92112781e-01 5.34596619e-02 -1.40720514e-02 6.55814951e-02 -1.50955713e-02 1.94155710e-01 1.58522504e-01 -4.21008768e-02 -3.85384334e-01 5.36596632e-01 -3.60254010e-01 5.37352405e-01 -3.52821401e-01 -5.53262061e-01 2.51680211e-02 2.53899211e-01 -5.09007881e-02 -9.24619657e-04 7.40553422e-03] [ 9.92603646e-02 -9.35198752e-02 -9.77075013e-03 1.38635760e-02 3.57736748e-02 6.39704781e-02 2.56726942e-02 4.73172755e-02 -3.49057302e-02 -1.29087995e-01 5.30828625e-02 -3.03795020e-02 5.95628451e-01 -5.54731107e-02 -3.31044463e-01 1.48092851e-01 -4.06622797e-02 1.07546812e-02 -3.02852615e-02 3.04536058e-02] [ 5.96004767e-02 -4.04175487e-02 -5.84006036e-02 7.13052934e-02 -1.93649522e-02 7.53243578e-02 -8.70167835e-02 -3.94584307e-02 7.81069853e-02 9.28309245e-02 -2.40482771e-01 1.96454170e-01 -2.53315336e-02 3.66165075e-01 7.23397589e-03 4.85256005e-02 -2.11966425e-01 4.93325840e-02 2.14113973e-02 -4.27514769e-02] [ -8.93619023e-02 2.47407841e-01 -3.95667905e-02 -3.38206464e-02 -2.16890736e-02 -9.02896723e-02 -1.35832916e-01 -1.20257397e-01 1.28549843e-01 2.89344912e-01 -3.48061361e-01 2.62250901e-01 -3.92673450e-01 1.25167059e-01 6.07402226e-01 -3.89432226e-02 -1.18462885e-01 -8.45860900e-04 -4.60110564e-03 2.75226688e-02] [ -1.64187474e-01 2.35683260e-01 1.50002565e-01 -2.73416763e-02 -1.47506796e-01 -1.37303980e-01 -1.10374469e-01 -1.36306070e-01 1.08165557e-01 5.57791940e-01 -3.30370514e-01 2.58925088e-01 -4.90201103e-01 -2.17190767e-01 7.29409613e-01 -2.82568959e-01 2.23080907e-01 -4.26666678e-03 -9.79051440e-03 -2.32288769e-03]] Variance of each feature per class [[ 0.72751922 0.27412012 1.12392991 0.78868296 1.15555181 1.12120442 2.00215964 0.62179042 1.6669198 0.65600474 0.29490454 0.97055848 0.0328407 0.02786879 0.0110952 0.50257458 0.51803117 0.98651735 1.06008543 0.83914516] [ 0.89748042 0.42756474 1.36326415 0.77489017 1.11476144 1.07136198 1.89164216 0.80893396 1.43982385 0.75382269 0.62934614 0.93279015 0.45261571 0.10093254 0.01734941 0.88662902 0.74717909 0.90600234 0.99916664 1.02714537] [ 1.14656108 0.43526874 1.21307489 0.95887254 1.20043673 0.96589725 1.66940571 0.76008927 1.06806287 0.71842576 0.69582424 1.0831846 1.40307042 0.2802383 0.08944294 1.04633309 0.75574013 0.93724183 0.99884899 1.00893309] [ 1.35314349 0.73430794 0.9604316 1.04011882 1.11039866 1.13942746 1.09283388 0.93364688 1.05668087 0.83603551 0.91325897 1.01703065 1.41207333 0.90332362 0.52872947 1.25445702 1.00809856 1.01258955 0.96160302 1.03605234] [ 1.21440487 0.88731566 0.76068652 1.20227458 0.93916204 1.16331977 0.67550996 1.05189748 0.86432101 0.93769331 0.98101408 0.84526064 0.9667387 1.48370291 1.00784489 1.08820543 1.01982177 1.05586792 1.02602904 0.94624467] [ 0.66520557 1.61852087 0.83860679 0.90049524 0.93181078 0.78926554 0.48684468 1.14846362 0.77021665 1.04692357 0.79774726 0.77618188 0.34024693 1.19550631 1.29465797 0.92579528 1.10920275 0.99899346 0.99427954 1.03266265] [ 0.37259117 1.59197236 1.58339652 0.9197332 0.51772327 0.67308291 0.58583149 1.16608977 0.80885756 1.10951053 1.13314928 0.77988136 0.1285785 0.5863825 1.26485305 0.39277329 0.94742489 0.99493282 0.9877837 0.99716739]] Predict Probability [[ 9.78933322e-001 1.96678630e-002 1.37410903e-003 ..., 1.95204378e-006 1.13551826e-006 6.06241859e-009] [ 3.11526057e-002 8.30995038e-001 9.63635979e-002 ..., 6.27341074e-003 3.95799808e-003 7.07936703e-006] [ 8.65832248e-001 1.31709138e-001 2.28837644e-003 ..., 9.23904745e-008 2.34057676e-007 1.62951558e-004] ..., [ 4.14479493e-054 1.76284897e-015 1.56516839e-006 ..., 3.15076043e-001 6.26708426e-001 4.87674155e-002] [ 4.14479493e-054 1.76284897e-015 1.56516839e-006 ..., 3.15076043e-001 6.26708426e-001 4.87674155e-002] [ 2.37605421e-103 1.21535194e-066 8.70440996e-015 ..., 1.40258323e-002 2.73640432e-001 7.12103237e-001]]

Ensemble Methods

Random Forest
Pred [1 5 5 ..., 2 2 1] Mean : 0.909469 Feature Importances [ 0.01574237 0.0152938 0.01492231 0.0184253 0.01671296 0.0227888 0.01556403 0.02954133 0.02546658 0.1940718 0.22034108 0.01850768 0.03806563 0.02549743 0.04026472 0.01668293 0.17064961 0.0338506 0.03442094 0.03319011] Predict Probability [[ 0.74777778 0.25222222 0. ..., 0. 0. 0. ] [ 0.8 0. 0.2 ..., 0. 0. 0. ] [ 0.60727273 0.33687166 0.05585561 ..., 0. 0. 0. ] ..., [ 0. 0. 0. ..., 0. 0. 1. ] [ 0. 0. 0. ..., 0. 0. 1. ] [ 0. 0. 0. ..., 0. 0. 1. ]] Transform [[-1.37108833 1.28185539 0.72992665] [-1.37108833 0.03388946 -1.4315749 ] [-1.37108833 1.28185539 1.0901769 ] ..., [ 1.72109748 0.24188378 0.00942613] [ 1.72109748 0.24188378 0.00942613] [ 1.72109748 0.24188378 0.36967639]]
Adaboost
Pred [1 5 5 ..., 2 2 1] Mean : 0.910289 Feature Importances [ 0.00976162 0.01121367 0.01210828 0.01586263 0.01547273 0.01839148 0.00946322 0.02490826 0.02157679 0.17348857 0.25357233 0.02079125 0.03249232 0.03153995 0.04896464 0.01553601 0.19668538 0.02954983 0.03039174 0.0282293 ]
Extra Trees
Pred [1 5 5 ..., 2 2 1] Mean : 0.909833 Feature Importances [ 0.01114193 0.01207747 0.01244962 0.01568419 0.01520974 0.01847939 0.01000801 0.02658796 0.01330737 0.18679507 0.25358345 0.01780327 0.02927668 0.0283059 0.05228376 0.02188926 0.18886312 0.0290693 0.03076994 0.02641458]
Gradient Boosting
Pred [1 5 5 ..., 2 2 2] Mean : 0.475987 Feature Importances [ 0.01622699 0.01101957 0.00386317 0.00910628 0.01038051 0.02331455 0.01389899 0.03001674 0.0235695 0.08343904 0.18106806 0.08129819 0.12179502 0.09331017 0.0992527 0.06905604 0.11523619 0.00512814 0.00045272 0.00856743]

Comparison between classifiers

Accuracy
knn 0.837419
Trees 0.910234
SVM 0.542404
Logistic Regression 0.425043
Naive Bayes 0.304602
Random Forest 0.909469
Adaboost 0.910289
Extra Trees 0.909833
Gradient Boosting 0.475987
In [21]:
# Accuracies of various classifiers compared to each other

Analyzing knn, Trees, Random Forest

knn

Trees

Random Forest

Graph Mining

Adjacency, Two-mode Incidence, Similarity and Overlap Network Graphs for investors and the companies invested in.

In [14]:
from IPython.display import Image

path='/Users/Work/Documents/CS249/project/crunchbase/plots/final/png/'
files = ['Acompanies_top110companies.png',
         'Ainvestors_top10companies.png',
         'iIC_top110companies.4.png',
         'Heatmap_Top20Companies.png',
        'Jaccard_Similarity_Companies.png',
        'Jaccard_Similarity_Investors.png',
        'olInvestor1gcustomlayoutRand5.png'
         ]

for file in files:
    #print('{0}{1}'.format(path, file))
    display(Image(filename='{0}{1}'.format(path, file)))

Adjacency and Two-mode Incidence Graphs for acquirers and the companies acquired.

In [15]:
path='/Users/Work/Documents/CS249/project/crunchbase/plots/final/png/'
files = ['Aacqcompanies_top1000companies.png',
         'Aacquirer_top1000companies.png',
         'iAC_top1000companies.png']

for file in files:
    display(Image(filename='{0}{1}'.format(path, file)))
        

Adjacency, Two-mode Incidence and Overlap Network Graphs for acquirer regions and regions of companies acquired.

In [16]:
path='/Users/Work/Documents/CS249/project/crunchbase/plots/final/png/'
files = ['AacqcompanyRegions_top1000companies.png',
        'AacquirerRegions_top1000companies.png',
        'iACRegions_top1000.png',
        'olAcqRegions1gcustomlayoutTop1000.png']

for file in files:
    display(Image(filename='{0}{1}'.format(path, file)))
In []: